Exploratory Data Analysis

Let’s analyze who the strongest pokémon is. One way to choose the strongest pokémon is to rank them according to the sum of their stats. This is already existing in the total column.

datatable(pokedex[order(-pokedex$Total),c("Name", stat_name, "Catch_Rate")])

The above table shows that 11 pokémon are tired for second place. Leaving Arceus as the strongest pokémon. Using the Total stats may not be as meaningful or helpful in choosing only ONE pokémon for battles and raids.

Alternatively, the catch rate could also be a good indicator of strength. Stronger pokémon would be generally harder to catch.

datatable(pokedex[order(-pokedex$Catch_Rate, -pokedex$Total),c("Name", stat_name, "Catch_Rate")])

Using the z-score could be an effective approach in determining the strongest pokémon. The z-score gives you an idea of how far from the mean a data point is.

Creating new tables with relevant stats columns for calculation. Summing all values into a new “total” column naming it Strength. Strength will help choose the best pokémon.

# Using Lapply function to calculate Z-score for stats variables (including catch rate)
z_stats = as.data.frame(lapply(pokedex[,c(6:11, 22)], function(x) (x-mean(x))/sd(x)))
# Summing all values into a new "total" column naming it Strength
z_stats$Strength <- rowSums(z_stats)
z_stats <- cbind(pokedex[,c(1:2,13)],z_stats)

head(arrange(z_stats,desc(Strength)), n = 3)
##   Number    Name isLegendary       HP   Attack   Defense   Sp_Atk   Sp_Def
## 1    493  Arceus        True 1.997038 1.552077 1.6790846 1.780677 1.876999
## 2    716 Xerneas        True 2.229161 1.931590 0.8257421 2.162780 1.062662
## 3    717 Yveltal        True 2.229161 1.931590 0.8257421 2.162780 1.062662
##      Speed Catch_Rate Strength
## 1 1.990097 -1.2699806 9.605991
## 2 1.220244 -0.7214881 8.710692
## 3 1.220244 -0.7214881 8.710692
datatable(z_stats[order(-z_stats$Strength),])

Arceus seems like the obvious choice but its catch rate is 3 which means it is difficult to catch. Afterall, he is known as the god pokémon. Another strong non-legendary pokémon is Slaking.

Distribution of Stats Variables

temp <- pokedex %>% 
  gather(key = Number, value = value, Total:Speed) %>% 
  mutate(key = as.factor(Number))

levels(temp$key) <- sort(stat_name)
temp$key <- factor(temp$key, levels = stat_name)

p1 <- temp %>% 
  filter(key == 'Total') %>% 
  ggplot(aes(value)) + 
  geom_histogram(color = theme_colors[2], fill = theme_colors[4]) + 
  labs(x = NULL, y = NULL, 
       title = "Summary of Distribution", 
       subtitle = "Total stat distribution is the only bimodal distribution.") + 
  facet_wrap(~ key)

p2 <- temp %>% 
  filter(key != 'Total') %>% 
  ggplot(aes(value)) + 
  geom_histogram(color = theme_colors[2], fill = theme_colors[4]) + 
  labs(x = NULL, y = NULL) + 
  facet_wrap(~ key)

grid.arrange(p1, p2, layout_matrix = matrix(c(1, 1, 2, 2, 2)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Understanding Legendary Pokemon

Population of Legendary Pokemon

pie <- pokedex %>%
  group_by(isLegendary) %>%
  summarise(Pokemon = n_distinct(Number))

labels <- c("Not Legendary", "Legendary")

plot_ly(pie, labels = labels, 
        values = ~ Pokemon, 
        type = 'pie', 
        textinfo='percent', 
        title ='Population for Legendary vs Non-Legendary Pokemon',
        showlegend = T)

Stats of Legendary vs Non-Legendary Pokemon

p3 <- temp %>% 
  filter(key == 'Total') %>% 
  ggplot(aes(value, fill = isLegendary, color = isLegendary)) + 
  geom_density(alpha = 0.75) + 
  scale_fill_manual(values = theme_colors[1:2]) + 
  scale_color_manual(values = theme_colors[1:2]) + 
  labs(x = NULL, y = NULL, 
       title = 'Non-Legendary Pokemons Have Lower Stats \nThan Legendary Pokemons', 
       subtitle = 'There are few non-legendary pokemons who hae higher stats than legendary pokemon') + 
  facet_wrap(~ key)

p4 <- temp %>% 
  filter(key != 'Total') %>% 
  ggplot(aes(value, fill = isLegendary, color = isLegendary)) + 
  geom_density(alpha = 0.75) + 
  scale_fill_manual(values = theme_colors[1:2]) + 
  scale_color_manual(values = theme_colors[1:2]) + 
  guides(color = F, fill = F) + 
  labs(x = NULL, y = NULL) + 
  facet_wrap(~ key)

grid.arrange(p3, p4, layout_matrix = matrix(c(1, 1, 2, 2, 2)))

Relationship between Defense and Attack Stats

pokedex %>% 
  ggplot(aes(Attack, Defense,  color = isLegendary, fill = isLegendary)) + 
  geom_point(colors= theme_colors[5:6], size = 5) + 
  coord_flip() + 
  labs(x = 'Special Attack', y = 'Special Defense', 
       title = 'Relationship between Special Defense and Special Attack Stats', 
       subtitle = 'Legendary Pokemon have higher Special Defense stats than regular Defense stats')
## Warning: Ignoring unknown parameters: colours

Relationship between Defense and Attack Stats

pokedex %>% 
  ggplot(aes(Sp_Atk, Sp_Def ,  color = isLegendary, fill = isLegendary)) + 
  geom_point(colors= theme_colors[5:6], size = 5) + 
  coord_flip() + 
  labs(x = 'Attack', y = 'Defense', 
       title = 'Relationship between Defense and Attack Stats', 
       subtitle = 'Legendary Pokemon have higher Defense and Attack stats')
## Warning: Ignoring unknown parameters: colours

Relationship between Height and Weight Stats

pokedex %>% 
  ggplot(aes(Height_m, Weight_kg, color = isLegendary, fill = isLegendary)) + 
  geom_point(colors= theme_colors[5:6], size = 5) + 
  coord_flip() + 
  labs(x = 'Height', y = 'Weight', 
       title = 'Relationship between Height and Weight Stats', 
       subtitle = 'Legendary Pokemon are heavier and taller')
## Warning: Ignoring unknown parameters: colours

Variables Correlation

corr <- round(cor(pokedex[,c(stat_name)]), 1)
ggcorrplot(corr, method = 'circle')

Defense and Speed have the lowest correlation.


Model Algorithms

Trying to predict whether a pokémon is legendary or not is a classification problem. The Top 3 models for this would be
1. Logistic Regression
Pros: Simplest algorithm for classification and easier to implement
Cons: Assumes a linear relationship between dependent and independent variables; Sensitive to outliers

  1. Decision Tree
    Pros: Does assume normal distributed data; less effort for data prep
    Cons: Prone to overfitting

  2. Random Forest
    Pros: Does great with prediction (since Random Forest is a collection of Decision Trees); Sophisticated output with variable importance; Provides higher accuracy through cross validation
    Cons: Unlike Decision Trees, Random Forest is a black box mode

I will be building a Random Forest model since it is the most robust of the three models.


Feature Engineering

Data Summary

After importing the data (aka the pokédex), I noticed some blank values for Type 2 and Egg Group 2. I replaced the blanks with a “None” category.

# Select the required columns. Convert required columns to correct data type

pokedex <- subset(pokedex, select = -Catch_Rate) 
pokedex$isLegendary <- as.factor(recode(pokedex$isLegendary, True = 1, False = 0))
pokedex$Generation <- as.factor(pokedex$Generation)

levels(pokedex$Type_2)[match("",levels(pokedex$Type_2))] <- "None"
levels(pokedex$Egg_Group_2)[match("",levels(pokedex$Egg_Group_2))] <- "None"

str(pokedex)
## 'data.frame':    721 obs. of  22 variables:
##  $ Number          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name            : Factor w/ 721 levels "Abomasnow","Abra",..: 68 295 670 87 88 86 595 688 54 82 ...
##  $ Type_1          : Factor w/ 18 levels "Bug","Dark","Dragon",..: 10 10 10 7 7 7 18 18 18 1 ...
##  $ Type_2          : Factor w/ 19 levels "None","Bug","Dark",..: 15 15 15 1 1 9 1 1 1 1 ...
##  $ Total           : int  318 405 525 309 405 534 314 405 530 195 ...
##  $ HP              : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ Attack          : int  49 62 82 52 64 84 48 63 83 30 ...
##  $ Defense         : int  49 63 83 43 58 78 65 80 100 35 ...
##  $ Sp_Atk          : int  65 80 100 60 80 109 50 65 85 20 ...
##  $ Sp_Def          : int  65 80 100 50 65 85 64 80 105 20 ...
##  $ Speed           : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ Generation      : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ isLegendary     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Color           : Factor w/ 10 levels "Black","Blue",..: 4 4 4 8 8 8 2 2 2 4 ...
##  $ hasGender       : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Pr_Male         : num  0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.875 0.5 ...
##  $ Egg_Group_1     : Factor w/ 15 levels "Amorphous","Bug",..: 11 11 11 11 11 11 11 11 11 2 ...
##  $ Egg_Group_2     : Factor w/ 14 levels "None","Amorphous",..: 8 8 8 4 4 4 12 12 12 1 ...
##  $ hasMegaEvolution: Factor w/ 2 levels "False","True": 1 1 2 1 1 2 1 1 2 1 ...
##  $ Height_m        : num  0.71 0.99 2.01 0.61 1.09 1.7 0.51 0.99 1.6 0.3 ...
##  $ Weight_kg       : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ Body_Style      : Factor w/ 14 levels "bipedal_tailed",..: 10 10 10 1 1 1 1 1 1 8 ...

Partition pokédex into test and train pokédexes. Using 70/30 split and random startified sampling.

set.seed(355)
split <- createDataPartition(pokedex$isLegendary, p = 0.7, list = FALSE)
pokedex_train <- pokedex[split,]
pokedex_test <- pokedex[-split,]

Data Imputation

From the pokédex, NA values are only present in the pr_male column with 77 missing values. There are multiple ways to handle missing values: 1. Delete rows with NA values. This can easily be done if there are a lot of records which is not the case here. 2. Delete the variable. This might be a significant feature for prediction. 3. Impute missing values with mean/median/mode of the column. Simple and basic technique. 4.Predict the missing values using kNN or bagImpute.

Another thing to consider is that the pr_male is the probability of a pokémon being male, with 1-pr_male being the probability of a pokémon being female. In the case of genderless pokémons such as Magnetron or Mewtwo, the value is neither, putting them in a third category.

colSums(is.na(pokedex))
##           Number             Name           Type_1           Type_2 
##                0                0                0                0 
##            Total               HP           Attack          Defense 
##                0                0                0                0 
##           Sp_Atk           Sp_Def            Speed       Generation 
##                0                0                0                0 
##      isLegendary            Color        hasGender          Pr_Male 
##                0                0                0               77 
##      Egg_Group_1      Egg_Group_2 hasMegaEvolution         Height_m 
##                0                0                0                0 
##        Weight_kg       Body_Style 
##                0                0

For the sake of simplicity and time, I’ll impute the missing values with the kNN prediction.

set.seed(355)
preproc <- preProcess(pokedex_train, method=c("knnImpute"))
preproc
## Created from 451 samples and 22 variables
## 
## Pre-processing:
##   - centered (11)
##   - ignored (11)
##   - 5 nearest neighbor imputation (11)
##   - scaled (11)

Using this preproc model to predict missing values

set.seed(355)
newpokedex_train <- predict(preproc, newdata=pokedex_train)

# Check for any NA values in the data frame
anyNA(newpokedex_train)
## [1] FALSE
head(newpokedex_train)
##      Number       Name Type_1 Type_2      Total         HP     Attack
## 1 -1.727436  Bulbasaur  Grass Poison -0.9187183 -0.8828777 -0.8810376
## 2 -1.722653    Ivysaur  Grass Poison -0.1222823 -0.3182942 -0.4312537
## 3 -1.717869   Venusaur  Grass Poison  0.9762501  0.4344838  0.2607215
## 4 -1.713086 Charmander   Fire   None -1.0011082 -1.1087110 -0.7772414
## 5 -1.708302 Charmeleon   Fire   None -0.1222823 -0.3935720 -0.3620562
## 6 -1.703519  Charizard   Fire Flying  1.0586400  0.3592060  0.3299190
##      Defense     Sp_Atk     Sp_Def       Speed Generation isLegendary Color
## 1 -0.7472619 -0.1211084 -0.2064367 -0.74283365          1           0 Green
## 2 -0.2806185  0.4133175  0.3351654 -0.18294215          1           0 Green
## 3  0.3860151  1.1258852  1.0573015  0.56357985          1           0 Green
## 4 -0.9472520 -0.2992503 -0.7480389  0.00368835          1           0   Red
## 5 -0.4472768  0.4133175 -0.2064367  0.56357985          1           0   Red
## 6  0.2193567  1.4465407  0.5156994  1.31010185          1           0   Red
##   hasGender  Pr_Male Egg_Group_1 Egg_Group_2 hasMegaEvolution    Height_m
## 1      True 1.528184     Monster       Grass            False -0.45268600
## 2      True 1.528184     Monster       Grass            False -0.14747732
## 3      True 1.528184     Monster       Grass             True  0.96435432
## 4      True 1.528184     Monster      Dragon            False -0.56168911
## 5      True 1.528184     Monster      Dragon            False -0.03847422
## 6      True 1.528184     Monster      Dragon             True  0.62644470
##    Weight_kg     Body_Style
## 1 -0.5532930      quadruped
## 2 -0.4858694      quadruped
## 3  0.4757469      quadruped
## 4 -0.5356082 bipedal_tailed
## 5 -0.4195510 bipedal_tailed
## 6  0.3707428 bipedal_tailed

One-Hot Encoding

Applying One-hot encoding to transform categorical variables to transform to category columns with numerical values. Saving the pokémon number,name and target variables to x. These variables will be attached to the data frame after the encoding.

x <- c("Number","Name","isLegendary")
saved_columns_train <- pokedex_train[,(names(pokedex_train) %in% x)]
newpokedex_train <- newpokedex_train[,!(names(pokedex_train) %in% x)]

dummy <- dummyVars(~., data = newpokedex_train)
newpokedex_train <- data.frame(predict(dummy, newdata = newpokedex_train))

Normalize Predictors

Normalizing the training pokédex for the variables to be between 0 and 1.

rangeModel <- preProcess(newpokedex_train, method = "range")
newpokedex_train <- predict(rangeModel, newdata = newpokedex_train)
pokedex_train <- cbind(saved_columns_train,newpokedex_train)

Transforming Pokédex Test data

Tranforming test data using the same three procedures: 1. Imputation of missing values using the “kNN” model object. 2. One-hot encoding using the “dummyModel” object. 3. Normalization using the “rangeModel” object.

newpokedex_test <- predict(preproc, newdata=pokedex_test)

saved_columns_test <- pokedex_test[,(names(pokedex_test) %in% x)]
newpokedex_test <- newpokedex_test[,!(names(pokedex_test) %in% x)]

newpokedex_test <- data.frame(predict(dummy, newdata = newpokedex_test))
newpokedex_test <- predict(rangeModel, newdata = newpokedex_test)

pokedex_test <- cbind(saved_columns_test,newpokedex_test)

Class Imbalance

One thing that was observed in the Data Exploration section was that the predictor class has a class imbalance. Training on such a training data can cause bias and overfitting. There are two solutions to class imbalance: - Undersample the majority class data (remove some normal pokémon records from train data) - Oversample the minority data (synthesize additional legendary pokémon records using existing example data from train data)

I chose to oversample the pokedex_train data.

set.seed(2131)
pokedex_train_oversampled <- upSample(x = pokedex_train[,-3],y= pokedex_train$isLegendary, yname = "isLegendary")
pokedex_train_oversampled <- as.data.frame(pokedex_train_oversampled)
pokedex_train_oversampled <- pokedex_train_oversampled[, c(1:2,113,3:112)]

The dataset now has a 1:1 balance for predictor variable isLegendary.

rm(dummy) rm(newpokedex_train) rm(newpokedex_test) rm(preproc) rm(rangeModel) rm(saved_columns_train) rm(saved_columns_test) rm(split)

Model Building and Prediction

Model Building

start_time = Sys.time()
set.seed(355)
model_rf <- train(isLegendary ~., 
                  data = na.omit(pokedex_train_oversampled), 
                  method = "rf",
                  importance = TRUE,
                  verbose = TRUE)
model_rf
## Random Forest 
## 
## 946 samples
## 112 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 946, 946, 946, 946, 946, 946, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.5977823  0.2189567
##    40   0.9935295  0.9870357
##   831   0.9912899  0.9825277
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 40.
end_time = Sys.time()
end_time - start_time
## Time difference of 15.51096 mins

Model Assessment

predict_rf <- predict(model_rf,pokedex_test)

# Test Accuracy
predict_rf <- as.factor(predict_rf)
print(confusionMatrix(predict_rf,pokedex_test$isLegendary))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 201   0
##          1   1  13
##                                           
##                Accuracy : 0.9953          
##                  95% CI : (0.9744, 0.9999)
##     No Information Rate : 0.9395          
##     P-Value [Acc > NIR] : 2.226e-05       
##                                           
##                   Kappa : 0.9605          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9950          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.9395          
##          Detection Rate : 0.9349          
##    Detection Prevalence : 0.9349          
##       Balanced Accuracy : 0.9975          
##                                           
##        'Positive' Class : 0               
## 
# Create prediction and performance objects for the random forest
prob_rf <- predict(model_rf, pokedex_test, type = "prob")
pred_rf <- prediction(prob_rf[,2], pokedex_test$isLegendary)
perf_rf <- performance(pred_rf, "tpr", "fpr")

# Plot ROC curves
plot(perf_rf, col = "red", main = "ROC curve")

Variable Importance

Note that a random forest returns two measures of variable importance: - MeanDecreaseAccuracy: how much the model accuracy suffers if you leave out a particular variable - MeanDecreaseGini: the degree to which a variable improves the probability of an observation being classified one way or another (i.e. ‘node purity’).

var_imp <- varImp(model_rf)
var_imp
## rf variable importance
## 
##   only 20 most important variables shown (out of 831)
## 
##                          Importance
## Egg_Group_1.Undiscovered     100.00
## Total                         90.43
## Sp_Atk                        64.37
## hasGender.False               62.79
## Attack                        59.03
## HP                            58.82
## Sp_Def                        58.08
## Number                        57.91
## hasGender.True                57.17
## Speed                         55.51
## Defense                       54.89
## Weight_kg                     53.21
## Height_m                      50.82
## Pr_Male                       50.20
## Egg_Group_2.None              48.62
## Egg_Group_1.Mineral           38.58
## Egg_Group_1.Monster           36.57
## Body_Style.two_wings          34.99
## Type_1.Dragon                 33.83
## Type_1.Bug                    32.03
# Top 20 important variables 
imp_var <- c("Number","Name", "isLegendary","Egg_Group_1.Undiscovered", "Total", "Sp_Atk", "hasGender.False", "Attack", "HP", "Sp_Def", "Number", "hasGender.True", "Speed", "Defense", "Weight_kg", "Height_m", "Pr_Male", "Egg_Group_2.None","Egg_Group_1.Mineral","Egg_Group_1.Mineral","Body_Style.two_wings","Type_1.Dragon","Type_1.Bug")

Basic Parameter Tuning

fitControl <- trainControl(## 5-fold CV repeat 5 times
                           method = "repeatedcv",
                           number = 5,
                           repeats = 5)

Refitting Model

Subseting train and test pokédex to have top 20 important variables.

pokedex_train_v2 <- pokedex_train_oversampled[,c(imp_var)]
pokedex_test_v2 <- pokedex_test[,c(imp_var)]
start_time = Sys.time()
set.seed(355)
model_rf_v2 <- train(isLegendary ~.,
                     data = na.omit(pokedex_train_v2),
                     importance = TRUE,
                     trControl = fitControl,
                     verbose = TRUE)
model_rf
## Random Forest 
## 
## 946 samples
## 112 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 946, 946, 946, 946, 946, 946, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.5977823  0.2189567
##    40   0.9935295  0.9870357
##   831   0.9912899  0.9825277
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 40.
end_time = Sys.time()
end_time - start_time
## Time difference of 8.145455 mins

Testing Refitted Model

predict_rf_v2 <- predict(model_rf_v2, pokedex_test_v2)
predict_rf_v2 <- as.factor(predict_rf_v2)
print(confusionMatrix(predict_rf_v2,pokedex_test_v2$isLe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 201   0
##          1   1  13
##                                           
##                Accuracy : 0.9953          
##                  95% CI : (0.9744, 0.9999)
##     No Information Rate : 0.9395          
##     P-Value [Acc > NIR] : 2.226e-05       
##                                           
##                   Kappa : 0.9605          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9950          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.9395          
##          Detection Rate : 0.9349          
##    Detection Prevalence : 0.9349          
##       Balanced Accuracy : 0.9975          
##                                           
##        'Positive' Class : 0               
## 

Results and Conclusion

Initially, 113 variables were used for the model. However, the most important variables for the model were the Undiscovered egg group and the total stats points. The model was refitted with only the top 23 important variables. It was also tuned to do a 5-fold cross-validation repeated 5 times.

The performance of the Random Forest model is strong, with a 99.53% accuracy.